Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook (Structured Vision-Language Pretraining for Computational Cooking), first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the strutured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. Finally, we conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. The code will be made publicly available.
translated by 谷歌翻译
远见和语言预测已成为解决多模式下游任务的普遍方法。当前的趋势是朝着更大的模型和预处理数据集迈进。从长远来看,这一计算头急促似乎是不合理的,而是朝着可持续的解决方案迈进,事实上,排除了资源有限的学术实验室。在这项工作中,我们提出了一个称为VICHA的新框架,该框架有效利用输入数据以通过以下方式提高学习,以: ,(c)利用图像级注释,称为视觉概念,使用现有基础模型(例如剪辑)获得,以提高图像编码器的性能。尽管对数据的预估计少了四倍,但我们的VICHA策略在下游任务(例如图像文本检索,VQA,视觉推理,视觉上和视觉接地)上的其他方法优于其他方法。该代码将在此处公开提供:https://github.com/mshukor/vicha
translated by 谷歌翻译
我们在本文中提出了一个新的面部视频压缩范式。我们利用诸如stylegan之类的gan的生成能力来表示和压缩视频,包括内部和间压缩。每个帧都在StyleGAN的潜在空间中倒置,从中学习了最佳压缩。为此,使用归一化流量模型学习了差异潜在表示,可以在其中优化熵模型以用于图像编码。此外,我们提出了一种新的感知损失,比其他同行更有效。最后,在先前构造的潜在表示中还学习了用于视频间编码的熵模型。我们的方法(SGANC)很简单,训练的速度更快,并且与最新的编解码器(例如VTM,AV1和最近的深度学习技术)相比,为图像和视频编码提供了更好的结果。特别是,它在低比特速率下极大地最大程度地减少了感知失真。
translated by 谷歌翻译
事实证明,通过倒转和操纵与输入真实图像相对应的潜在代码,生成的对抗网络(GAN)对于图像编辑非常有效。这种编辑属性来自潜在空间的分离性质。在本文中,我们确定面部属性分离不是最佳的,因此依靠线性属性分离的面部编辑是有缺陷的。因此,我们建议通过监督改善语义分解。我们的方法包括使用归一化流量学习代理潜在表示,我们证明这会为面部图像编辑提供更有效的空间。
translated by 谷歌翻译
使用卫星图像的建筑物分类对于诸如损害评估,资源分配和人口估算的若干应用而言变得越来越重要。在这项工作中,我们专注于建筑物损伤评估(BDA)和住宅和非住宅建筑的建筑物类型分类(BTC)。我们建议仅依赖于RGB卫星图像并遵循基于2级的深度学习的方法,其中使用语义分割模型提取建筑物的足迹,然后进行裁剪图像的分类。由于缺乏住宅/非住宅建筑物分类的适当数据集,我们介绍了一个新的高分辨率卫星图像数据集。我们进行广泛的实验,选择最佳的超参数,模型架构和培训范式,我们提出了一种新的转移基于学习的方法,以优于经典方法。最后,我们验证了两种应用中提出的方法,呈现出卓越的准确性和F1分数指标。
translated by 谷歌翻译
Wind power forecasting helps with the planning for the power systems by contributing to having a higher level of certainty in decision-making. Due to the randomness inherent to meteorological events (e.g., wind speeds), making highly accurate long-term predictions for wind power can be extremely difficult. One approach to remedy this challenge is to utilize weather information from multiple points across a geographical grid to obtain a holistic view of the wind patterns, along with temporal information from the previous power outputs of the wind farms. Our proposed CNN-RNN architecture combines convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to extract spatial and temporal information from multi-dimensional input data to make day-ahead predictions. In this regard, our method incorporates an ultra-wide learning view, combining data from multiple numerical weather prediction models, wind farms, and geographical locations. Additionally, we experiment with global forecasting approaches to understand the impact of training the same model over the datasets obtained from multiple different wind farms, and we employ a method where spatial information extracted from convolutional layers is passed to a tree ensemble (e.g., Light Gradient Boosting Machine (LGBM)) instead of fully connected layers. The results show that our proposed CNN-RNN architecture outperforms other models such as LGBM, Extra Tree regressor and linear regression when trained globally, but fails to replicate such performance when trained individually on each farm. We also observe that passing the spatial information from CNN to LGBM improves its performance, providing further evidence of CNN's spatial feature extraction capabilities.
translated by 谷歌翻译
Recent advances in deep learning have enabled us to address the curse of dimensionality (COD) by solving problems in higher dimensions. A subset of such approaches of addressing the COD has led us to solving high-dimensional PDEs. This has resulted in opening doors to solving a variety of real-world problems ranging from mathematical finance to stochastic control for industrial applications. Although feasible, these deep learning methods are still constrained by training time and memory. Tackling these shortcomings, Tensor Neural Networks (TNN) demonstrate that they can provide significant parameter savings while attaining the same accuracy as compared to the classical Dense Neural Network (DNN). In addition, we also show how TNN can be trained faster than DNN for the same accuracy. Besides TNN, we also introduce Tensor Network Initializer (TNN Init), a weight initialization scheme that leads to faster convergence with smaller variance for an equivalent parameter count as compared to a DNN. We benchmark TNN and TNN Init by applying them to solve the parabolic PDE associated with the Heston model, which is widely used in financial pricing theory.
translated by 谷歌翻译
In this manuscript, we present a novel method for estimating the stochastic stability characteristics of metastable legged systems using the unscented transformation. Prior methods for stability analysis in such systems often required high-dimensional state space discretization and a broad set of initial conditions, resulting in significant computational complexity. Our approach aims to alleviate this issue by reducing the dimensionality of the system and utilizing the unscented transformation to estimate the output distribution. This technique allows us to account for multiple sources of uncertainty and high-dimensional system dynamics, while leveraging prior knowledge of noise statistics to inform the selection of initial conditions for experiments. As a result, our method enables the efficient assessment of controller performance and analysis of parametric dependencies with fewer experiments. To demonstrate the efficacy of our proposed method, we apply it to the analysis of a one-dimensional hopper and an underactuated bipedal walking simulation with a hybrid zero dynamics controller.
translated by 谷歌翻译
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without
translated by 谷歌翻译
Deep learning can extract rich data representations if provided sufficient quantities of labeled training data. For many tasks however, annotating data has significant costs in terms of time and money, owing to the high standards of subject matter expertise required, for example in medical and geophysical image interpretation tasks. Active Learning can identify the most informative training examples for the interpreter to train, leading to higher efficiency. We propose an Active learning method based on jointly learning representations for supervised and unsupervised tasks. The learned manifold structure is later utilized to identify informative training samples most dissimilar from the learned manifold from the error profiles on the unsupervised task. We verify the efficiency of the proposed method on a seismic facies segmentation dataset from the Netherlands F3 block survey, significantly outperforming contemporary methods to achieve the highest mean Intersection-Over-Union value of 0.773.
translated by 谷歌翻译